Tree-Based Models

Regression Tree

In [1]:
import pandas as pd
import numpy as np
In [2]:
Hitters=pd.read_csv('Hitters.csv')
Hitters.head()
Out[2]:
Unnamed: 0 AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits ... CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
0 -Andy Allanson 293 66 1 30 29 14 1 293 66 ... 30 29 14 A E 446 33 20 NaN A
1 -Alan Ashby 315 81 7 24 38 39 14 3449 835 ... 321 414 375 N W 632 43 10 475.0 N
2 -Alvin Davis 479 130 18 66 72 76 3 1624 457 ... 224 266 263 A W 880 82 14 480.0 A
3 -Andre Dawson 496 141 20 65 78 37 11 5628 1575 ... 828 838 354 N E 200 11 3 500.0 N
4 -Andres Galarraga 321 87 10 39 42 30 2 396 101 ... 48 46 33 N E 805 40 4 91.5 N

5 rows × 21 columns

In [3]:
Hitters.rename(columns = {'Unnamed: 0':'Name'}, inplace = True)
Hitters.head()
Out[3]:
Name AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits ... CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
0 -Andy Allanson 293 66 1 30 29 14 1 293 66 ... 30 29 14 A E 446 33 20 NaN A
1 -Alan Ashby 315 81 7 24 38 39 14 3449 835 ... 321 414 375 N W 632 43 10 475.0 N
2 -Alvin Davis 479 130 18 66 72 76 3 1624 457 ... 224 266 263 A W 880 82 14 480.0 A
3 -Andre Dawson 496 141 20 65 78 37 11 5628 1575 ... 828 838 354 N E 200 11 3 500.0 N
4 -Andres Galarraga 321 87 10 39 42 30 2 396 101 ... 48 46 33 N E 805 40 4 91.5 N

5 rows × 21 columns

We see that there are some NA values in the dataset. We first remove those rows.

In [4]:
Hitters = Hitters.dropna()
Hitters.head()
Out[4]:
Name AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits ... CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
1 -Alan Ashby 315 81 7 24 38 39 14 3449 835 ... 321 414 375 N W 632 43 10 475.0 N
2 -Alvin Davis 479 130 18 66 72 76 3 1624 457 ... 224 266 263 A W 880 82 14 480.0 A
3 -Andre Dawson 496 141 20 65 78 37 11 5628 1575 ... 828 838 354 N E 200 11 3 500.0 N
4 -Andres Galarraga 321 87 10 39 42 30 2 396 101 ... 48 46 33 N E 805 40 4 91.5 N
5 -Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 ... 501 336 194 A W 282 421 25 750.0 A

5 rows × 21 columns

As indicated in the lecture notes, we transform the salary to log-salary

In [5]:
Hitters["log-salary"]=np.log(Hitters.Salary)
Hitters.head()
Out[5]:
Name AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits ... CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague log-salary
1 -Alan Ashby 315 81 7 24 38 39 14 3449 835 ... 414 375 N W 632 43 10 475.0 N 6.163315
2 -Alvin Davis 479 130 18 66 72 76 3 1624 457 ... 266 263 A W 880 82 14 480.0 A 6.173786
3 -Andre Dawson 496 141 20 65 78 37 11 5628 1575 ... 838 354 N E 200 11 3 500.0 N 6.214608
4 -Andres Galarraga 321 87 10 39 42 30 2 396 101 ... 46 33 N E 805 40 4 91.5 N 4.516339
5 -Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 ... 336 194 A W 282 421 25 750.0 A 6.620073

5 rows × 22 columns

In [6]:
from sklearn.tree import DecisionTreeRegressor
In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Hitters[["Hits","Years"]], Hitters["log-salary"],train_size=int(0.8*len(Hitters)), random_state=42)
print(len(X_train))
len(X_test)
210
Out[7]:
53
In [8]:
dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)
Out[8]:
DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

To visualize the tree, you need to install and update the following packages. You can do it using the following codes. After running this code, you might want to restart the kernel to make sure that your operations is not affected.

In [9]:
conda install pydot
Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.
In [10]:
pip install six
Requirement already satisfied: six in /opt/anaconda3/lib/python3.7/site-packages (1.12.0)
Note: you may need to restart the kernel to use updated packages.

Restart the kernel here if you run the above code for the first time.

In [11]:
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO 
from IPython.display import Image 
import pydot
/opt/anaconda3/lib/python3.7/site-packages/sklearn/externals/six.py:31: FutureWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/).
  "(https://pypi.org/project/six/).", FutureWarning)
In [12]:
dot_data = StringIO()
export_graphviz(dt, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
(graph,) = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())
Out[12]:

Clearly, this tree is very large. We can specify ccp_alpha to do the cost complexity pruning.

In [13]:
dt_02 = DecisionTreeRegressor(ccp_alpha=0.02)
dt_02.fit(X_train, y_train)
dot_data = StringIO()
export_graphviz(dt_02, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
(graph,) = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())
Out[13]:
In [14]:
dt_05 = DecisionTreeRegressor(ccp_alpha=0.05)
dt_05.fit(X_train, y_train)
dot_data = StringIO()
export_graphviz(dt_05, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
(graph,) = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())
Out[14]:

Let's check the test MSE for these three tress.

In [15]:
from sklearn.metrics import mean_squared_error as MSE
In [16]:
MSE(dt_05.predict(X_test),y_test)
Out[16]:
0.36436899149434937
In [17]:
MSE(dt_02.predict(X_test),y_test)
Out[17]:
0.1982851067283483
In [18]:
MSE(dt.predict(X_test),y_test)
Out[18]:
0.6136008399792554

We can see that there is a clear trade-off when setting ccp_alpha. You can also control the complexity by setting other parameters such as max_depth, min_samples_split, min_samples_leaf, min_weight_fraction_leaf. For the meaning of these parameters, you can use the help function on the object.

In [19]:
help(DecisionTreeRegressor)
Help on class DecisionTreeRegressor in module sklearn.tree._classes:

class DecisionTreeRegressor(sklearn.base.RegressorMixin, BaseDecisionTree)
 |  DecisionTreeRegressor(criterion='mse', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, presort='deprecated', ccp_alpha=0.0)
 |  
 |  A decision tree regressor.
 |  
 |  Read more in the :ref:`User Guide <tree>`.
 |  
 |  Parameters
 |  ----------
 |  criterion : str, optional (default="mse")
 |      The function to measure the quality of a split. Supported criteria
 |      are "mse" for the mean squared error, which is equal to variance
 |      reduction as feature selection criterion and minimizes the L2 loss
 |      using the mean of each terminal node, "friedman_mse", which uses mean
 |      squared error with Friedman's improvement score for potential splits,
 |      and "mae" for the mean absolute error, which minimizes the L1 loss
 |      using the median of each terminal node.
 |  
 |      .. versionadded:: 0.18
 |         Mean Absolute Error (MAE) criterion.
 |  
 |  splitter : str, optional (default="best")
 |      The strategy used to choose the split at each node. Supported
 |      strategies are "best" to choose the best split and "random" to choose
 |      the best random split.
 |  
 |  max_depth : int or None, optional (default=None)
 |      The maximum depth of the tree. If None, then nodes are expanded until
 |      all leaves are pure or until all leaves contain less than
 |      min_samples_split samples.
 |  
 |  min_samples_split : int, float, optional (default=2)
 |      The minimum number of samples required to split an internal node:
 |  
 |      - If int, then consider `min_samples_split` as the minimum number.
 |      - If float, then `min_samples_split` is a fraction and
 |        `ceil(min_samples_split * n_samples)` are the minimum
 |        number of samples for each split.
 |  
 |      .. versionchanged:: 0.18
 |         Added float values for fractions.
 |  
 |  min_samples_leaf : int, float, optional (default=1)
 |      The minimum number of samples required to be at a leaf node.
 |      A split point at any depth will only be considered if it leaves at
 |      least ``min_samples_leaf`` training samples in each of the left and
 |      right branches.  This may have the effect of smoothing the model,
 |      especially in regression.
 |  
 |      - If int, then consider `min_samples_leaf` as the minimum number.
 |      - If float, then `min_samples_leaf` is a fraction and
 |        `ceil(min_samples_leaf * n_samples)` are the minimum
 |        number of samples for each node.
 |  
 |      .. versionchanged:: 0.18
 |         Added float values for fractions.
 |  
 |  min_weight_fraction_leaf : float, optional (default=0.)
 |      The minimum weighted fraction of the sum total of weights (of all
 |      the input samples) required to be at a leaf node. Samples have
 |      equal weight when sample_weight is not provided.
 |  
 |  max_features : int, float, str or None, optional (default=None)
 |      The number of features to consider when looking for the best split:
 |  
 |      - If int, then consider `max_features` features at each split.
 |      - If float, then `max_features` is a fraction and
 |        `int(max_features * n_features)` features are considered at each
 |        split.
 |      - If "auto", then `max_features=n_features`.
 |      - If "sqrt", then `max_features=sqrt(n_features)`.
 |      - If "log2", then `max_features=log2(n_features)`.
 |      - If None, then `max_features=n_features`.
 |  
 |      Note: the search for a split does not stop until at least one
 |      valid partition of the node samples is found, even if it requires to
 |      effectively inspect more than ``max_features`` features.
 |  
 |  random_state : int, RandomState instance or None, optional (default=None)
 |      If int, random_state is the seed used by the random number generator;
 |      If RandomState instance, random_state is the random number generator;
 |      If None, the random number generator is the RandomState instance used
 |      by `np.random`.
 |  
 |  max_leaf_nodes : int or None, optional (default=None)
 |      Grow a tree with ``max_leaf_nodes`` in best-first fashion.
 |      Best nodes are defined as relative reduction in impurity.
 |      If None then unlimited number of leaf nodes.
 |  
 |  min_impurity_decrease : float, optional (default=0.)
 |      A node will be split if this split induces a decrease of the impurity
 |      greater than or equal to this value.
 |  
 |      The weighted impurity decrease equation is the following::
 |  
 |          N_t / N * (impurity - N_t_R / N_t * right_impurity
 |                              - N_t_L / N_t * left_impurity)
 |  
 |      where ``N`` is the total number of samples, ``N_t`` is the number of
 |      samples at the current node, ``N_t_L`` is the number of samples in the
 |      left child, and ``N_t_R`` is the number of samples in the right child.
 |  
 |      ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,
 |      if ``sample_weight`` is passed.
 |  
 |      .. versionadded:: 0.19
 |  
 |  min_impurity_split : float, (default=1e-7)
 |      Threshold for early stopping in tree growth. A node will split
 |      if its impurity is above the threshold, otherwise it is a leaf.
 |  
 |      .. deprecated:: 0.19
 |         ``min_impurity_split`` has been deprecated in favor of
 |         ``min_impurity_decrease`` in 0.19. The default value of
 |         ``min_impurity_split`` will change from 1e-7 to 0 in 0.23 and it
 |         will be removed in 0.25. Use ``min_impurity_decrease`` instead.
 |  
 |  presort : deprecated, default='deprecated'
 |      This parameter is deprecated and will be removed in v0.24.
 |  
 |      .. deprecated:: 0.22
 |  
 |  ccp_alpha : non-negative float, optional (default=0.0)
 |      Complexity parameter used for Minimal Cost-Complexity Pruning. The
 |      subtree with the largest cost complexity that is smaller than
 |      ``ccp_alpha`` will be chosen. By default, no pruning is performed. See
 |      :ref:`minimal_cost_complexity_pruning` for details.
 |  
 |      .. versionadded:: 0.22
 |  
 |  Attributes
 |  ----------
 |  feature_importances_ : ndarray of shape (n_features,)
 |      The feature importances.
 |      The higher, the more important the feature.
 |      The importance of a feature is computed as the
 |      (normalized) total reduction of the criterion brought
 |      by that feature. It is also known as the Gini importance [4]_.
 |  
 |  max_features_ : int,
 |      The inferred value of max_features.
 |  
 |  n_features_ : int
 |      The number of features when ``fit`` is performed.
 |  
 |  n_outputs_ : int
 |      The number of outputs when ``fit`` is performed.
 |  
 |  tree_ : Tree object
 |      The underlying Tree object. Please refer to
 |      ``help(sklearn.tree._tree.Tree)`` for attributes of Tree object and
 |      :ref:`sphx_glr_auto_examples_tree_plot_unveil_tree_structure.py`
 |      for basic usage of these attributes.
 |  
 |  See Also
 |  --------
 |  DecisionTreeClassifier : A decision tree classifier.
 |  
 |  Notes
 |  -----
 |  The default values for the parameters controlling the size of the trees
 |  (e.g. ``max_depth``, ``min_samples_leaf``, etc.) lead to fully grown and
 |  unpruned trees which can potentially be very large on some data sets. To
 |  reduce memory consumption, the complexity and size of the trees should be
 |  controlled by setting those parameter values.
 |  
 |  The features are always randomly permuted at each split. Therefore,
 |  the best found split may vary, even with the same training data and
 |  ``max_features=n_features``, if the improvement of the criterion is
 |  identical for several splits enumerated during the search of the best
 |  split. To obtain a deterministic behaviour during fitting,
 |  ``random_state`` has to be fixed.
 |  
 |  References
 |  ----------
 |  
 |  .. [1] https://en.wikipedia.org/wiki/Decision_tree_learning
 |  
 |  .. [2] L. Breiman, J. Friedman, R. Olshen, and C. Stone, "Classification
 |         and Regression Trees", Wadsworth, Belmont, CA, 1984.
 |  
 |  .. [3] T. Hastie, R. Tibshirani and J. Friedman. "Elements of Statistical
 |         Learning", Springer, 2009.
 |  
 |  .. [4] L. Breiman, and A. Cutler, "Random Forests",
 |         https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
 |  
 |  Examples
 |  --------
 |  >>> from sklearn.datasets import load_boston
 |  >>> from sklearn.model_selection import cross_val_score
 |  >>> from sklearn.tree import DecisionTreeRegressor
 |  >>> X, y = load_boston(return_X_y=True)
 |  >>> regressor = DecisionTreeRegressor(random_state=0)
 |  >>> cross_val_score(regressor, X, y, cv=10)
 |  ...                    # doctest: +SKIP
 |  ...
 |  array([ 0.61..., 0.57..., -0.34..., 0.41..., 0.75...,
 |          0.07..., 0.29..., 0.33..., -1.42..., -1.77...])
 |  
 |  Method resolution order:
 |      DecisionTreeRegressor
 |      sklearn.base.RegressorMixin
 |      BaseDecisionTree
 |      sklearn.base.MultiOutputMixin
 |      sklearn.base.BaseEstimator
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, criterion='mse', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, presort='deprecated', ccp_alpha=0.0)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  fit(self, X, y, sample_weight=None, check_input=True, X_idx_sorted=None)
 |      Build a decision tree regressor from the training set (X, y).
 |      
 |      Parameters
 |      ----------
 |      X : {array-like or sparse matrix} of shape (n_samples, n_features)
 |          The training input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csc_matrix``.
 |      
 |      y : array-like of shape (n_samples,) or (n_samples, n_outputs)
 |          The target values (real numbers). Use ``dtype=np.float64`` and
 |          ``order='C'`` for maximum efficiency.
 |      
 |      sample_weight : array-like of shape (n_samples,), default=None
 |          Sample weights. If None, then samples are equally weighted. Splits
 |          that would create child nodes with net zero or negative weight are
 |          ignored while searching for a split in each node.
 |      
 |      check_input : bool, (default=True)
 |          Allow to bypass several input checking.
 |          Don't use this parameter unless you know what you do.
 |      
 |      X_idx_sorted : array-like of shape (n_samples, n_features), optional
 |          The indexes of the sorted training input samples. If many tree
 |          are grown on the same dataset, this allows the ordering to be
 |          cached between trees. If None, the data will be sorted here.
 |          Don't use this parameter unless you know what to do.
 |      
 |      Returns
 |      -------
 |      self : object
 |          Fitted estimator.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  classes_
 |  
 |  n_classes_
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.RegressorMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Return the coefficient of determination R^2 of the prediction.
 |      
 |      The coefficient R^2 is defined as (1 - u/v), where u is the residual
 |      sum of squares ((y_true - y_pred) ** 2).sum() and v is the total
 |      sum of squares ((y_true - y_true.mean()) ** 2).sum().
 |      The best possible score is 1.0 and it can be negative (because the
 |      model can be arbitrarily worse). A constant model that always
 |      predicts the expected value of y, disregarding the input features,
 |      would get a R^2 score of 0.0.
 |      
 |      Parameters
 |      ----------
 |      X : array-like of shape (n_samples, n_features)
 |          Test samples. For some estimators this may be a
 |          precomputed kernel matrix or a list of generic objects instead,
 |          shape = (n_samples, n_samples_fitted),
 |          where n_samples_fitted is the number of
 |          samples used in the fitting for the estimator.
 |      
 |      y : array-like of shape (n_samples,) or (n_samples, n_outputs)
 |          True values for X.
 |      
 |      sample_weight : array-like of shape (n_samples,), default=None
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          R^2 of self.predict(X) wrt. y.
 |      
 |      Notes
 |      -----
 |      The R2 score used when calling ``score`` on a regressor will use
 |      ``multioutput='uniform_average'`` from version 0.23 to keep consistent
 |      with :func:`~sklearn.metrics.r2_score`. This will influence the
 |      ``score`` method of all the multioutput regressors (except for
 |      :class:`~sklearn.multioutput.MultiOutputRegressor`). To specify the
 |      default value manually and avoid the warning, please either call
 |      :func:`~sklearn.metrics.r2_score` directly or make a custom scorer with
 |      :func:`~sklearn.metrics.make_scorer` (the built-in scorer ``'r2'`` uses
 |      ``multioutput='uniform_average'``).
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.RegressorMixin:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from BaseDecisionTree:
 |  
 |  apply(self, X, check_input=True)
 |      Return the index of the leaf that each sample is predicted as.
 |      
 |      .. versionadded:: 0.17
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features)
 |          The input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csr_matrix``.
 |      
 |      check_input : bool, (default=True)
 |          Allow to bypass several input checking.
 |          Don't use this parameter unless you know what you do.
 |      
 |      Returns
 |      -------
 |      X_leaves : array_like, shape = [n_samples,]
 |          For each datapoint x in X, return the index of the leaf x
 |          ends up in. Leaves are numbered within
 |          ``[0; self.tree_.node_count)``, possibly with gaps in the
 |          numbering.
 |  
 |  cost_complexity_pruning_path(self, X, y, sample_weight=None)
 |      Compute the pruning path during Minimal Cost-Complexity Pruning.
 |      
 |      See :ref:`minimal_cost_complexity_pruning` for details on the pruning
 |      process.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features)
 |          The training input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csc_matrix``.
 |      
 |      y : array-like of shape (n_samples,) or (n_samples, n_outputs)
 |          The target values (class labels) as integers or strings.
 |      
 |      sample_weight : array-like of shape (n_samples,), default=None
 |          Sample weights. If None, then samples are equally weighted. Splits
 |          that would create child nodes with net zero or negative weight are
 |          ignored while searching for a split in each node. Splits are also
 |          ignored if they would result in any single class carrying a
 |          negative weight in either child node.
 |      
 |      Returns
 |      -------
 |      ccp_path : Bunch
 |          Dictionary-like object, with attributes:
 |      
 |          ccp_alphas : ndarray
 |              Effective alphas of subtree during pruning.
 |      
 |          impurities : ndarray
 |              Sum of the impurities of the subtree leaves for the
 |              corresponding alpha value in ``ccp_alphas``.
 |  
 |  decision_path(self, X, check_input=True)
 |      Return the decision path in the tree.
 |      
 |      .. versionadded:: 0.18
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features)
 |          The input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csr_matrix``.
 |      
 |      check_input : bool, (default=True)
 |          Allow to bypass several input checking.
 |          Don't use this parameter unless you know what you do.
 |      
 |      Returns
 |      -------
 |      indicator : sparse csr array, shape = [n_samples, n_nodes]
 |          Return a node indicator matrix where non zero elements
 |          indicates that the samples goes through the nodes.
 |  
 |  get_depth(self)
 |      Return the depth of the decision tree.
 |      
 |      The depth of a tree is the maximum distance between the root
 |      and any leaf.
 |      
 |      Returns
 |      -------
 |      self.tree_.max_depth : int
 |          The maximum depth of the tree.
 |  
 |  get_n_leaves(self)
 |      Return the number of leaves of the decision tree.
 |      
 |      Returns
 |      -------
 |      self.tree_.n_leaves : int
 |          Number of leaves.
 |  
 |  predict(self, X, check_input=True)
 |      Predict class or regression value for X.
 |      
 |      For a classification model, the predicted class for each sample in X is
 |      returned. For a regression model, the predicted value based on X is
 |      returned.
 |      
 |      Parameters
 |      ----------
 |      X : array-like or sparse matrix of shape (n_samples, n_features)
 |          The input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csr_matrix``.
 |      
 |      check_input : bool, (default=True)
 |          Allow to bypass several input checking.
 |          Don't use this parameter unless you know what you do.
 |      
 |      Returns
 |      -------
 |      y : array-like of shape (n_samples,) or (n_samples, n_outputs)
 |          The predicted classes, or the predict values.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from BaseDecisionTree:
 |  
 |  feature_importances_
 |      Return the feature importances.
 |      
 |      The importance of a feature is computed as the (normalized) total
 |      reduction of the criterion brought by that feature.
 |      It is also known as the Gini importance.
 |      
 |      Returns
 |      -------
 |      feature_importances_ : array, shape = [n_features]
 |          Normalized total reduction of critera by feature (Gini importance).
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |  
 |  __repr__(self, N_CHAR_MAX=700)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep : bool, default=True
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : mapping of string to any
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The latter have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Parameters
 |      ----------
 |      **params : dict
 |          Estimator parameters.
 |      
 |      Returns
 |      -------
 |      self : object
 |          Estimator instance.

The importance of each feature can be known from the following code.

In [20]:
dt_02.feature_importances_
Out[20]:
array([0.24781281, 0.75218719])

Classification Tree

In [21]:
Heart=pd.read_csv('Heart.csv',index_col=0)
Heart = Heart.dropna()
Heart.head()
Out[21]:
Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca Thal AHD
1 63 1 typical 145 233 1 2 150 0 2.3 3 0.0 fixed No
2 67 1 asymptomatic 160 286 0 2 108 1 1.5 2 3.0 normal Yes
3 67 1 asymptomatic 120 229 0 2 129 1 2.6 2 2.0 reversable Yes
4 37 1 nonanginal 130 250 0 0 187 0 3.5 3 0.0 normal No
5 41 0 nontypical 130 204 0 2 172 0 1.4 1 0.0 normal No
In [22]:
from sklearn.tree import DecisionTreeClassifier
In [23]:
X=Heart.drop('AHD', axis=1)
X.head()
Out[23]:
Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca Thal
1 63 1 typical 145 233 1 2 150 0 2.3 3 0.0 fixed
2 67 1 asymptomatic 160 286 0 2 108 1 1.5 2 3.0 normal
3 67 1 asymptomatic 120 229 0 2 129 1 2.6 2 2.0 reversable
4 37 1 nonanginal 130 250 0 0 187 0 3.5 3 0.0 normal
5 41 0 nontypical 130 204 0 2 172 0 1.4 1 0.0 normal
In [24]:
X=pd.get_dummies(X)
X.head()
Out[24]:
Age Sex RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca ChestPain_asymptomatic ChestPain_nonanginal ChestPain_nontypical ChestPain_typical Thal_fixed Thal_normal Thal_reversable
1 63 1 145 233 1 2 150 0 2.3 3 0.0 0 0 0 1 1 0 0
2 67 1 160 286 0 2 108 1 1.5 2 3.0 1 0 0 0 0 1 0
3 67 1 120 229 0 2 129 1 2.6 2 2.0 1 0 0 0 0 0 1
4 37 1 130 250 0 0 187 0 3.5 3 0.0 0 1 0 0 0 1 0
5 41 0 130 204 0 2 172 0 1.4 1 0.0 0 0 1 0 0 1 0
In [25]:
y=Heart.AHD
In [26]:
XX_train, XX_test, yy_train, yy_test = train_test_split(X,y,train_size=int(0.8*len(Hitters)), random_state=42)
In [ ]:
 
In [27]:
dtc02 = DecisionTreeClassifier(ccp_alpha=0.02)
dtc02.fit(XX_train,yy_train)
Out[27]:
DecisionTreeClassifier(ccp_alpha=0.02, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')
In [28]:
dot_data = StringIO()
export_graphviz(dtc02, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
(graph,) = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())
Out[28]:
In [29]:
dtc02.classes_
Out[29]:
array(['No', 'Yes'], dtype=object)

With the above code, we know the the value corresponse to the occurancy of ['No', 'Yes'] classes. We can also learn the importance of the features through the following code.

In [30]:
dtc02.feature_importances_
Out[30]:
array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.62890992, 0.17699649, 0.        , 0.        , 0.        ,
       0.        , 0.19409359, 0.        ])
In [31]:
np.argmax(dtc02.feature_importances_)
Out[31]:
10
In [32]:
XX_train.columns
Out[32]:
Index(['Age', 'Sex', 'RestBP', 'Chol', 'Fbs', 'RestECG', 'MaxHR', 'ExAng',
       'Oldpeak', 'Slope', 'Ca', 'ChestPain_asymptomatic',
       'ChestPain_nonanginal', 'ChestPain_nontypical', 'ChestPain_typical',
       'Thal_fixed', 'Thal_normal', 'Thal_reversable'],
      dtype='object')
In [33]:
XX_train.columns[10]
Out[33]:
'Ca'
In [34]:
dtc02.predict(XX_test)
Out[34]:
array(['No', 'Yes', 'No', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'No',
       'Yes', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No',
       'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'No',
       'Yes', 'No', 'Yes', 'No', 'No', 'No', 'Yes', 'No', 'Yes', 'Yes',
       'No', 'Yes', 'Yes', 'No', 'No', 'Yes', 'No', 'No', 'No', 'No',
       'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes',
       'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes',
       'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'No',
       'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes'], dtype=object)
In [35]:
from sklearn.metrics import confusion_matrix

cm=confusion_matrix(y_true=yy_test,  y_pred=dtc02.predict(XX_test) ).T
cm
Out[35]:
array([[40,  9],
       [ 8, 30]])
In [36]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_true= yy_test,y_score = dtc02.predict_proba(XX_test)[:,1])
Out[36]:
0.8632478632478633

Bagging and Random Forest

The syntax of random forest is similar to the decision tree. When max_features=None, it is the bagging; otherwise it is the random forest. Default value is max_features=auto, in which case the number of features used is sqrt(n_features).

In [37]:
from sklearn.ensemble import RandomForestClassifier
In [38]:
Tree_bag=RandomForestClassifier(n_estimators=1000, max_features=None,random_state=42)
Tree_bag.fit(XX_train,yy_train)
Out[38]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features=None,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)
In [39]:
roc_auc_score(y_true= yy_test,y_score = Tree_bag.predict_proba(XX_test)[:,1])
Out[39]:
0.9134615384615385

We can tune the parameters to improve the performance.

In [40]:
Tree_bag=RandomForestClassifier(n_estimators=1000, max_features=None, min_samples_split = 0.2,random_state=42)
Tree_bag.fit(XX_train,yy_train)
roc_auc_score(y_true= yy_test,y_score = Tree_bag.predict_proba(XX_test)[:,1])
Out[40]:
0.922008547008547

How about random forest?

In [41]:
RF=RandomForestClassifier(n_estimators=1000,min_samples_split = 0.01,  random_state=42)
RF.fit(XX_train,yy_train)
roc_auc_score(y_true= yy_test,y_score = RF.predict_proba(XX_test)[:,1])
Out[41]:
0.9460470085470085
In [42]:
RF.feature_importances_
Out[42]:
array([0.09282559, 0.03011241, 0.07373566, 0.08469478, 0.00989072,
       0.01738844, 0.11910595, 0.0355567 , 0.09458349, 0.02661849,
       0.13979625, 0.07420234, 0.03116124, 0.00849727, 0.0115093 ,
       0.00514748, 0.08358388, 0.06159002])
In [43]:
np.argmax(RF.feature_importances_)
Out[43]:
10
In [44]:
for i in range(len(XX_train.columns)):
    print(XX_train.columns[i],RF.feature_importances_[i])
Age 0.09282559386654922
Sex 0.03011241023696509
RestBP 0.07373565622391116
Chol 0.08469477727623136
Fbs 0.00989071879312888
RestECG 0.017388436506987185
MaxHR 0.11910595258383021
ExAng 0.035556703499633684
Oldpeak 0.09458349397285958
Slope 0.026618491453248256
Ca 0.13979624705429416
ChestPain_asymptomatic 0.07420233853735006
ChestPain_nonanginal 0.03116123913772496
ChestPain_nontypical 0.008497267062851758
ChestPain_typical 0.011509302795558604
Thal_fixed 0.0051474757668333515
Thal_normal 0.08358387785443651
Thal_reversable 0.06159001737760581

Gradient Boosting

In [45]:
from  sklearn.ensemble import GradientBoostingClassifier
In [46]:
gbc = GradientBoostingClassifier(random_state=31)
gbc.fit(XX_train,yy_train)
roc_auc_score(y_true= yy_test,y_score = gbc.predict_proba(XX_test)[:,1])
Out[46]:
0.8883547008547008
In [47]:
gbc = GradientBoostingClassifier(learning_rate=0.005, n_estimators=1000,max_depth=1, max_features=2, random_state=31)
gbc.fit(XX_train,yy_train)
roc_auc_score(y_true= yy_test,y_score = gbc.predict_proba(XX_test)[:,1])
Out[47]:
0.9540598290598291

By carefully choosing the parameters, Gradient Boosting methods could work very well. But it is quite sensitive to parameter values.